Exploiting Comparable Corpora with TER and TERp
نویسندگان
چکیده
In this paper we present an extension of a successful simple and effective method for extracting parallel sentences from comparable corpora and we apply it to an Arabic/English NIST system. We experiment with a new TERp filter, along with WER and TER filters. We also report a comparison of our approach with that of (Munteanu and Marcu, 2005) using exactly the same corpora and show performance gain by using much lesser data. Our approach employs an SMT system built from small amounts of parallel texts to translate the source side of the nonparallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems.
منابع مشابه
Using TERp to Augment the System Combination for SMT
TER-Plus (TERp) is an extended TER evaluation metric incorporating morphology, synonymy and paraphrases. There are three new edit operations in TERp: Stem Matches, Synonym Matches and Phrase Substitutions (Paraphrases). In this paper, we propose a TERp-based augmented system combination in terms of the backbone selection and consensus decoding network. Combining the new properties of the TERp, ...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملTERp System Description
This paper describes TER-Plus (TERp) the University of Maryland / BBN Technologies submission for the NIST Metric MATR 2008 workshop on automatic machine translation evaluation metrics. TERp is an extension of Translation Edit Rate (TER) that builds off of the success of TER as an evaluation metric and alignment tool while addressing several of its weaknesses through the use of paraphrases, mor...
متن کاملPartitioning of retained energy in broilers and birds with intermediate growth rate.
An experiment was conducted to study energy retained (TER) as fat (TERF) and protein (TERP) in 3 strains of birds with different growth rate; commercial broilers, Barred Plymouth Rock, and Leghorns. Birds were fed ad libitum a diet providing 3,100 kcal of AMEn/kg and 20% CP from 0 to 42 d. Body composition, TER, TERF, and TERP were determined at 0, 7, 10, 15, 19, 23, 28, 33, 37, and 42 d of age...
متن کاملExploiting the Leipzig Corpora Collection
In this paper the Leipzig Corpora Collection is introduced as a contribution to the idea that there is need for standardization of multilingual language resources. We explain the steps of building, processing and presenting corpora of comparable sizes and in a uniform format. Results from intraand interlingual comparisons of corpora are given and methods that can build upon these corpora
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009